System and method for genomic association

ABSTRACT

In variants, a method for genomic association can include: determining observed variable values and observed phenotype values for each organism in a population, removing information from variables of interest, determining a phenotype-variable association model, identifying causal variables associated with a phenotype, and/or any other suitable steps.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.18/119,030 filed 8 Mar. 2023, which claims the benefit of U.S.Provisional Application No. 63/317,656 filed 8 Mar. 2022, U.S.Provisional Application No. 63/325,831 filed 31 Mar. 2022, U.S.Provisional Application No. 63/350,326 filed o8 Jun. 2022, and U.S.Provisional Application No. 63/350,328 filed 8 Jun. 2022, each of whichis incorporated in its entirety by this reference.

TECHNICAL FIELD

This invention relates generally to the genomic field, and morespecifically to a new and useful system and method in the genomic field.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a schematic representation of a variant of the method.

FIG. 2 depicts an example of selecting a subset of variables anddetermining a variable window within the subset of variables.

FIG. 3 depicts an example of selecting a subset of variables, includingclustering k-mer variables.

FIG. 4 depicts an example of determining a variable window based onvariable analysis parameters.

FIGS. 5A and 5B depict examples of iteratively determining a variablewindow.

FIG. 6 depicts an example of test variable generation using a linearregression variable-variable association model.

FIG. 7A depicts an example of training a variable-variable associationmodel.

FIG. 7B depicts an example of determining a set of test variables usinga trained variable-variable association model.

FIG. 8 depicts an example of determining a transformed test variable.

FIG. 9 depicts an example of determining a test variable using a processmodel.

FIG. 10 depicts an example of determining a test variable using avariable value distribution.

FIGS. 11A, 11B, and 11C depict illustrative examples of an observed dataset, a test data set with information for one variable replaced, and atest data set with information for all variables replaced, respectively.

FIG. 12 depicts a first example of determining an association metric.

FIG. 13 depicts a second example of determining an association metric.

FIG. 14 depicts a third example of determining an association metric.

FIG. 15 depicts a fourth example of determining an association metric.

FIG. 16 depicts an example of determining an association metric for avariable based on multiple test model metrics for the variable.

FIG. 17 depicts an illustrative example of association metrics.

FIGS. 18A and 18B depict illustrative examples of identifying casualvariables.

FIG. 19 depicts an example of identifying causal variables.

FIGS. 20 and 21 depict a first and second variant of determining a testsummary statistic.

FIG. 22 is a schematic representation of a variant of the method fordetermining a set of breeding parameters.

FIGS. 23A, 23B, and 23C depict examples of determining a phenotypemodel.

FIG. 24A depicts a first example of determining a target causal variablevalue set.

FIG. 24B depicts a second example of determining a target causalvariable value set.

FIG. 25 is a schematic representation of a variant of determiningbreeding parameters.

FIG. 26 is an illustrative example of a phenotype value distribution forthe descendants of a plurality of parent sets.

DETAILED DESCRIPTION

The following description of the embodiments of the invention is notintended to limit the invention to these embodiments, but rather toenable any person skilled in the art to make and use this invention.

1. Overview

As shown in FIG. 1 , the method can include: determining observed valuesfor variables and phenotypes for each organism in a population S100,removing information from variables of interest S300, determining aphenotype-variable association model S500, identifying causal variablesassociated with a phenotype Shoo, and/or any other suitable steps. Invariants, the method can function to identify genomic components,environmental parameters, and/or other variables linked to (e.g.,causing or influencing) a target phenotype.

2. Examples

In an example, the method for determining a variable's association witha phenotype can include: observing a genotype and phenotype for eachorganism in a population; aggregating the phenotype values from eachorganism into an observed phenotype (P) (e.g., a vector of the phenotypevalues); and determining an observed variable (V_(i)) for each of a setof variables (e.g., a vector of the genotype values and/or optionallyother variable values such as environmental parameters, methylation,etc.). In this example, each observed variable can be a vector ofobserved values (e.g., a vector of alleles, a vector of k-mers, a vectorof RNA transcription amounts, etc.) for the respective variable,constructed across the population of organisms (e.g., wherein theobserved variable and observed phenotype have the same organismordering). In variants, the method can optionally include selecting asubset of variables, wherein the remainder of the method can be limitedto analysis of the variables corresponding to the variable subset;alternatively, the method can be performed for all variables.

The influence of each variable on the phenotype can then be determinedby: generating a test variable (e.g., substitute variable) for avariable of interest to replace the corresponding observed variable;generating one or more phenotype-variable association models based onthe test variable, the observed variables, and the observed phenotype;determining an observed model metric and a test model metric for themodel; and determining the influence of the variable on the phenotypebased on a comparison between the observed and test model metrics. Inexamples, the model metric can be: the variable weight (e.g.,coefficients), the model's loss, the model's variance (e.g., coefficientof determination), and/or be any other model metric.

The test variable can optionally have the same statistical distributionas the respective observed variable, but not be generated usinginformation from said observed values. In an example, the test variablecan be determined by: determining (e.g., fitting, training, etc.) avariable-variable association model using observed variables, whereinthe resultant model (e.g., wherein the variable of interest is treatedas the dependent variable and the other observed variables are treatedas independent variables) can then be used to calculate the testvariable. For example, the variable-variable association model can beused to calculate test values for each organism based on the observedvalues for other variables for the organism.

In a specific example, the influence of each variable can be determinedby: determining test variables for each variable; and calculating asingle regression between the observed phenotype (e.g., the dependentvariable) and a combined matrix (e.g., the independent variables) thatincludes both the test variables and their respective observedcounterpart variables (e.g., the original genotypes). Highly-influentialvariables can be identified based on the difference between thecoefficient of the respective test variable and the coefficient of therespective observed variable (e.g., where the variables with the highestcoefficient difference can be treated as the most influential).

However, the method can be otherwise performed.

3. Technical Advantages

Variants of the technology can confer one or more advantages overconventional technologies.

First, it is oftentimes difficult to determine which specific variables(e.g., genomic components such as genes, loci, genomic regions, etc.;environmental parameters; gene expression; etc.) cause and/or areassociated with a given phenotype. Conventional methods of individuallyediting and testing genes is an intractable approach due to the inherentsize of the genome. This problem is compounded because phenotypes can bepolygenic, so both individual gene-phenotype effects and genecombination-phenotype effects need to be tested—the number of testingpermutations required is immense, and would require decades of datagathering. The inventors have discovered that causal variables can beidentified in a highly efficient manner by modeling the relationshipbetween a trait (e.g., phenotype) and a set of variables (e.g., genomiccomponents), removing the information unique to a genomic component ofinterest from the model, and comparing the model's performance with andwithout the genomic component's information—the more the model'sperformance degrades when the genomic component information is removed,the more causal the genomic component.

Second, it is incredibly difficult to ensure the stand-in information(e.g., test variable) used to stand-in for the original information(e.g., observed variable) has the same statistical distribution as theoriginal variable's information without using the original variableinformation. To solve this, the inventors have further discovered thatvalues for other variables (e.g., neighboring variables selected using avariable window) can be used to generate acceptable stand-in information(e.g., leveraging the fact that neighboring genes are oftentimescorrelated and/or share an evolutionary history), thereby mitigating oreliminating original information leakage into the test variable, andincreasing predictive power.

Third, the large dimensionality of the search space presents asignificant computational load problem. In a first example, theinventors have discovered that by computing a regression with bothoriginal and stand-in information (e.g., where the coefficients for theobserved variables and test variables can then be compared for eachvariable), an association metric for each variable can be computedfaster and with a lower computational load. In a second example,variants of the technology can reduce the dimensionality bypre-selecting a variable subset (e.g., using clustering techniques, withan initial regression to determine a subset of genomic components thatare associated with variables that have nonzero coefficients, etc.). Ina third example, the inventors have discovered that by adaptivelyselecting a variable window (e.g., identifying a smaller number ofvariables correlated with variables of interest), a lower dimensionalmodel (e.g., low-dimension linear regression) can be used to determinethe test variable. In a fourth example, the inventors have discoveredthat models (e.g., generative models, autoencoders, etc.) can outputmultiple test variables, which can significantly increase computationalspeed. In a fifth example, an encoder can be trained on other variables(e.g., trained on genomic component values elsewhere in a genome, farfrom the variables of interest), then iteratively implemented (e.g.,without re-training) to generate each new test variable. In a sixthexample, the inventors have discovered that test variables can begenerated in a reduced dimension space (e.g., a latent space), whereinthe causal variables can be identified in the reduced dimension space.

Fourth, in variants, simplifying assumptions can be made such thatidentifying variables associated with a phenotype is a tractableproblem. In a first example, there are a significant number of genomiccomponents (e.g., loci) for a single model, so a univariate model foreach genomic component can be implemented. In a second example,population structure is unknown and difficult to quantify, so principalcomponent analysis and/or kinship analysis can be used to correct forstructure. In a third example, the effect of certain genomic componentscan depend on the environment, so a separate analysis can be performedin separate environments. In a fourth example, many genomic componentsinteract to generate a phenotype, so the assumption can be made thatepistasis is small enough to be ignored. However, other assumptions canbe implemented.

Fifth, variants of the technology can reduce the dimensionality of thesearch space by identifying causal variables (e.g., conditional modelreliance features). For example, a phenotype model (e.g., updatedphenotype-variable association model) can be generated using only thecausal variables. Since the causal variables represent only a subset ofall potential variables, the (updated) phenotype-variable associationmodel can: include inter-variable interactions, be a low-dimensionalregression, be used to determine target causal variable values usingmore computationally efficient and accurate optimization techniques(e.g., convex optimization with a single solution), and/or provide otheradvantages.

Sixth, variants of the technology can identify a set of target causalvariable values (e.g., a target set of alleles, a target set ofenvironmental parameters, etc.) before attempting to predictively breedorganisms. This can be more computationally efficient (e.g., by limitingthe number of variables that are used in a predictive breedingalgorithm) and result in better-optimized organisms and growingenvironments (e.g., by avoiding local minima during optimization). Forexample, a hypothetical target organism with a target set ofrandomly-generated causal variable values resulting in the bestphenotype (e.g., the most performant trait) for a given growingenvironment can be identified before attempting to generate the optimalvariable values biologically through breeding. In another example, thehypothetical target organism, the environmental variable values, and/ortreatment values (e.g., DNA methylation treatments) resulting in themost performant trait can be identified before attempting topredictively breed the existing organisms.

However, further advantages can be provided by the system and methoddisclosed herein.

4. Method

As shown in FIG. 1 , the method can include: determining observed valuesfor variables and phenotypes for each organism in a population S100,removing information from variables of interest S300, determining aphenotype-variable association model S500, identifying causal variablesassociated with a phenotype Shoo, and/or any other suitable steps. Themethod can optionally include selecting a subset of variables S200,determining breeding parameters to achieve a target causal variablevalue set S700, and/or any other suitable steps.

All or portions of the method can be performed once (e.g., for aphenotype, for a species, for a variable, etc.), multiple times,iteratively (e.g., for each phenotype in a set, for each variable in aset, etc.), in real time (e.g., responsive to a request), concurrently,asynchronously, periodically, and/or at any other suitable time. All orportions of the method can be performed automatically, manually,semi-automatically, and/or otherwise performed.

All or portions of the method can be performed using a computing system,using a database (e.g., a system database, a third-party database,etc.), using a genomic sequencer, using assay tools, using measurementsystems, by a user, and/or by any other suitable system. The computingsystem can include one or more: CPUs, GPUs, custom FPGA/ASICS,microprocessors, servers, cloud computing, and/or any other suitablecomponents. The computing system can be local, remote, distributed, orotherwise arranged relative to any other system or module.

The method can be used with one or more models, includingvariable-variable association models, phenotype-variable associationmodels, analysis models, variable window models, process models, and/orany other model. The models can include or use: regression (e.g.,linear, nonlinear, multivariate, leverage regression, etc.),classification, neural networks (e.g., CNN, DNN, CAN, LSTM, RNN, etc.),rules, heuristics, equations (e.g., weighted equations, etc.), selection(e.g., from a library), instance-based methods (e.g., nearest neighbor),regularization methods (e.g., ridge regression), decision trees (e.g.,random forest), Bayesian methods (e.g., Naïve Bayes, Markov, hiddenMarkov models, etc.), kernel methods, deterministics, genetic programs,encoders (e.g., autoencoders), support vectors, ensemble methods,association rules, optimization methods (e.g., Bayesian optimization,convex optimization, non-convex optimization, multi-objectiveoptimization, etc.), statistical methods (e.g., probability), comparisonmethods (e.g., matching, distance metrics, thresholds, etc.),dimensionality reduction (e.g., principal component analysis,t-distributed stochastic neighbor embedding, linear discriminantanalysis, partial lest squares regression, Sammon mapping,multidimensional scaling, projection pursuit, etc.), clustering methods(e.g., k-means clustering, hierarchical clustering, expectationmaximization, etc.), generative models, process models, biologicalmodels, and/or any other suitable method. Models can use classical ortraditional models, machine learning models, and/or be otherwiseconfigured. Models can be low-dimension models, high-dimension models,and/or otherwise configured.

Models can be trained, learned, fit, predetermined, and/or can beotherwise determined. Models can be trained using self-supervisedlearning, semi-supervised learning, supervised learning, unsupervisedlearning, transfer learning, reinforcement learning, and/or any othersuitable training method.

The method can be used with variables. Variables are preferablycharacteristics associated with an organism, but can be otherwisedefined. Examples of variables include: genomic components (e.g.,genomic variables), gene expression (e.g., which gene and/or variantthereof are expressed, transcribed, etc.; DNA and/or RNA expression)(e.g., gene variables), protein expression (e.g., which proteins areexpressed) (e.g., protein variables), methylation (e.g., which DNApositions are methylated, overall amount of methylation, etc.),environmental variables (e.g., environmental parameters, such astemperature, light, heat, soil quality, nutrient composition, wateravailability, land grade, treatment application frequency, etc.),transcriptome variables (e.g., RNA locus, RNA transcript identifier, RNAregion, a gene corresponding to an RNA transcript, etc.), proteinbinding variables, microbial variables (e.g., for microbes associatedwith the organism), and/or any other characteristic associated with anorganism. Genomic components are preferably basic units shared acrossall organisms of a population (e.g., a species), but can alternativelybe otherwise defined. Examples of genomic components include: a gene, agene group, a locus (e.g., DNA or RNA), a gene region, RNA region, RNAtranscript identifier, k-mer, and/or any other genomic component.Examples of environmental variables include: temperature; pressure;light; humidity; concentration and/or distribution of macronutrientsand/or micronutrients (e.g., nitrogen, phosphorous, etc.); growingduration, treatment frequency, and/or any other temporal characteristicthereof; and/or any other characteristic of an organism's environment.The set of variables (e.g., a plurality of variables) preferablyincludes all variables, but can alternatively include a subset of thevariables (e.g., the variables that can be controlled, etc.), and/or beotherwise defined. For example, the variable set can include: allpossible loci, loci of interest, all possible genes (e.g., all genes ofone or more organisms in the population), expressible protein,environmental variables, genes of interest, all genomic regions (e.g.,nonoverlapping or overlapping), genomic regions of interest, allmethylated locations, methylated locations of interest, DNA and/or RNAsequences, all environmental parameters, environmental parameters ofinterest, and/or any other variables.

Variable values are preferably a measure of the organism's value for thegiven variable, but can be otherwise defined. Variable values can bequalitative, quantitative, relative, discrete, continuous, aclassification, numeric, binary, and/or be otherwise characterized. In afirst example, genomic component values can include: genotypes, DNAand/or RNA sequences, single nucleotide polymorphisms (SNPs), k-mers,k-mer counts, RNA counts, allele locations, presence/absence of agenomic component (e.g., of a particular gene sequence), evolutionaryhistory, heredity history, DNA fragmentation, and/or any other geneticand/or cellular information. In a specific example, a genomic componentvalue can be a numerical value representing the genotype (e.g., anallele coding) for an organism at a gene locus associated with thevariable. In examples, an allele coding can include a 0, 1, or 2 value(e.g., determined based on allele frequency in the population) and/orany other values (e.g., 0-9 values when more than two copies of anallele are present in the population). In a second specific example, agenomic component value can be a numerical value representing the k-mercount for an organism for a k-mer associated with the variable. Genomiccomponent values can optionally include and/or correspond to a set ofgenetic information (e.g., a set of genes, a set of SNPs, a set ofk-mers, a raw DNA sequence, etc.). In a second example, gene expressionvalues can include: RNA concentration; whether or not a given gene hasbeen expressed, and/or other measures of gene expression. In a thirdexample, protein expression values can include: whether a given proteinis expressed, concentration of each protein, and/or other measures ofprotein expression. In a fourth example, methylation values can include:a ratio between the number of times a gene is methylated and the numberof times a gene is sequenced (e.g., methylation fraction), and/or othermeasures of methylation. In a fifth example, environment values caninclude: temperature values, pressure values, nutrient concentration inthe growing medium, moisture level in the growing medium, humiditylevel, ultraviolet light level, and/or other measures of environmentalor growing variables. In a sixth example, transcriptome variable valuesinclude: RNA sequences, RNA expression (e.g., RNA transcription for agiven gene or allele), quantity of RNA transcript, transcription amountof a given RNA sequence or gene, and/or other transcriptome values. In aseventh example, protein binding variables can include a measure ofprotein binding affinity, and/or any other protein binding values. In aneighth example, species abundance counts for a microbial communityon/near the organism, and/or any other microbial information values.However, variables and/or variable values can be otherwise defined.

The method can be used with phenotypes (e.g., traits). The phenotype ispreferably an observable characteristic or trait of the organisms, butcan be otherwise defined. Phenotype values can be qualitative,quantitative, relative, discrete, continuous, a classification, numeric,binary, and/or be otherwise characterized. Examples of phenotype valuescan include: drought resistance metric, salt resistance metric, heatresistance metric, contaminant resistance metric, a macronutrientparameter and/or micronutrient parameter (e.g., density, composition,etc.), mass, height, appearance (e.g., color), compound processing(e.g., amount of nitrogen fixation, amount of heavy metal fixation,etc.), and/or any other trait values. In variants, the phenotype can betreated as a variable. However, phenotypes and/or phenotype values canbe otherwise defined.

Each variable and/or phenotype can be a vector including the values(e.g., observed values and/or test values) for the respective variableor phenotype for each of a set of organisms (e.g., ordered set oforganisms). The values can be observed values (e.g., from experiments ormeasurements); predicted, simulated, or otherwise generated values(e.g., predicted from genetic mutation simulations, predicted usingcross-breeding simulations, test values determined via S450 etc.);and/or otherwise determined. In a first example, a genomic componentvariable can be a vector of genomic component values (e.g., representinggenotypes corresponding to the genomic component, representing k-merscorresponding to the genomic component, etc.) with one genomic componentvalue in the vector for each organism in the population. In a secondexample, a phenotype variable can be a vector of phenotype values (e.g.,representing the presence and/or absence of one or more traits,representing a collection of traits, etc.) with one phenotype value foreach organism in the population. However, the phenotype and/or variablescan be otherwise represented.

Determining observed values for variables and phenotypes for eachorganism in a population S100 functions to determine information (e.g.,observed values) used for predicting association between variablesand/or between phenotypes and variables. S100 can be performed beforeS200, before S300, and/or at any other time. S100 can optionally beperformed one or more times: for each organism in a population oforganisms, for each phenotype in a phenotype set, for each variable in aphenotype set, and/or at any other time.

The organisms in the population (e.g., set of organisms) are preferablyof the same species, but can alternatively be of different species. Theorganisms can be any plant, animal, fungi, protist, moneran, and/or anyother organism. In illustrative examples, the organisms can be algae,broccoli, radishes, strawberry, dandelions, corn, bamboo, potatoes,mushrooms, herbs, pigs, cows, chickens, and/or any other organisms. Inspecific examples, the organisms can be used as food products, used tomanufacture food products (e.g., as an ingredient in a food product),used to manufacture materials (e.g., rubber, oil, etc.), and/or used forany other purposes.

S100 can include: for each organism, determining observed values foreach variable in a set of variables; for each organism, determiningobserved values for each phenotype in a set of phenotypes; determiningan observed variable for each of the set of variables based on theobserved variable values; and determining an observed phenotype for eachof the set of phenotypes based on the observed phenotype values. The setof variables preferably includes multiple variables, but canalternatively include a single variable. The set of phenotypes ispreferably a single phenotype (e.g., representing a single trait,representing an aggregate of traits, etc.), but can alternativelyinclude multiple phenotypes.

The observed values for variables for each organism and/or the observedvalues for phenotypes for each organism can be determined by: retrievingvalues from a database, genotyping, observing (e.g., measuring,sequencing, analyzing measurements, etc.), analyzing sequences,simulating/predicting (e.g., using a model, using cross-breeding and/ormutation simulations, etc.), aggregating values (e.g., aggregatingmultiple observed values for an organism to determine an aggregateobserved value), transforming values (e.g., converting qualitativevalues to quantitative values), a combination thereof (e.g., usingdifferent methods for different variables and/or phenotypes), and/or anyother method of determining organism information. Sequencing can includeDNA sequencing, RNA sequencing, k-mer counting, and/or any other geneticcomponent measurement. Determining observed values for phenotypes and/orvariables can optionally be performed after the physical organism isgrown and/or harvested, during growth and/or harvesting, and/or at anyother stage. In specific examples, determining observed values for aphysical organism can include: phenotyping the organism and optionallyconverting the organism's phenotype into a numerical value (e.g., usinga scoring method, ranking method, rating method, mapping, comparisonmethods, etc.) to determine the respective observed phenotype value;sequencing genomic components to determine the respective observedvariable value(s); measuring/recording environmental conditions (e.g.,growth conditions) for the organism to determine the respective observedvariable value; retrieving predetermined environmental conditions todetermine the respective observed variable value; measuring geneexpression to determine the respective observed variable value;measuring DNA methylation to determine the respective observed variablevalue; using a known relationship (e.g., regression, neural network, anyother model, lookup table, etc.) between a first variable (e.g.,environmental variable) and a second variable (e.g., expression of agene) to predict the observed values for the first variable, and/or anyother method of determining values for variables and/or phenotypes.

Determining an observed variable for each variable in the set ofvariables can include aggregating observed variable values across theorganisms to form a vector (e.g., a numerical vector). The observedvariable is preferably a vector of observed variable values for acorresponding variable (e.g., where all variable values in the observedvariable vector are associated with the same variable), but can beotherwise constructed. However, the method can include any otherobserved variable to variable cardinality (e.g., where each variable isassociated with more than one observed variable vector and/or viceversa). In a specific example, each element of the observed variablevector includes an observed value for a different organism (e.g., eachobserved variable includes an observed value for the correspondingvariable from each organism in the population). In an illustrativeexample, the observed variable for a particular genomic componentincludes an allele coding value from each organism for that genomiccomponent. In variants, observed variables can function as independentvariables in all or parts of the method.

Observed values within each observed variable are preferably ordered byorganism (e.g., such that vector elements correspond to the sameorganism order across observed variables), but can alternatively beotherwise arranged. Multiple observed variables (e.g., observedvariables for the genomic components of the species' genome) canoptionally be aggregated into an observed data set (e.g., a matrix, avector of variables, a set, a design matrix, etc.). An example is shownin FIGURE nA. In a first example, the observed variables are organizedwithin the data set (e.g., matrix column order, vector order, set order,etc.) based on a locus associated with the respective genomic componentof each variable. In a second example, the observed variables areunordered within the data set. In a third example, the observedvariables are organized within the data set based on association metricsfor each variable. However, the observed variables can be otherwiseorganized.

Determining an observed phenotype can include aggregating observedphenotype values across the organisms to form a vector (e.g., anumerical vector). The observed phenotype is preferably a vector ofobserved variable values for a corresponding phenotype (e.g., where allphenotype values in the observed phenotype vector are associated withthe same phenotype), but can be otherwise constructed. However, themethod can include any other observed phenotype to phenotype cardinality(e.g., where each phenotype is associated with more than one observedphenotype vector and/or vice versa). In a specific example, each elementof the observed phenotype vector includes an observed value for adifferent organism (e.g., each observed phenotype includes an observedvalue for the corresponding phenotype from each organism in thepopulation). In an illustrative example, the observed phenotype for aparticular genomic component includes a trait value from each organismfor that phenotype. In variants, observed phenotypes can function asdependent variables in all or parts of the method. Observed valueswithin a phenotype are preferably ordered by organism (e.g., such thatvector elements correspond to the same organism order across phenotypesand variables), but can alternatively be otherwise arranged.

However, observed values for variables and/or phenotypes can beotherwise determined.

The method can optionally include selecting a subset of variables S200,which functions to reduce the number of variables to analyze. Invariants, selecting a subset of variables (e.g., subsampling thevariables) can decrease computational load, increase analysis speed,transform high-dimensional data to low-dimensional data, enablelow-dimension models (e.g., low-dimension regressions, low-dimensionstatistics, etc.), enable the use of convex optimization methods (e.g.,strictly convex optimization methods, with a single solution), and/orprovide other advantages. S200 can be performed after S100, before S300(e.g., to determine which variables to use when removing information),before S400 (e.g., to determine which variables from which to select thevariable window), and/or at any other time. The set of variables canoptionally be adjusted to include only the variables corresponding tothe subset of variables. In all or parts of the method, the set ofvariables can refer to the subset of variables.

In a first variant, the variable subset is selected from the variableset by manually specifying the variables (e.g., loci).

In a second variant, the variable subset is automatically selected. Inexamples, the variable subset can be selected based on: the evolutionaryhistory of the population, linkage disequilibrium analysis, previousiterations of the method, information from S100, principal componentanalysis, kinship analysis, variable analysis parameters, coinheritance(e.g., a genome segment that is coinherited), and/or otherwise selected.

In a third variant, the variable subset is selected using a model. In afirst embodiment, the model is a phenotype-variable association model(e.g., a regression) including observed variables for each of the set ofvariables, wherein the variable subset is selected based on associationmetrics (e.g., regression coefficients, any association metric in Shoo,etc.) for each variable. In a first specific example, the variablescorresponding to observed variables that have nonzero coefficients areselected. In a second specific example, the variables corresponding toobserved variables that have a coefficient and/or absolute value of acoefficient above a threshold are selected. In a second embodiment, themodel is used to cluster observed variables, wherein each cluster is avariable subset (e.g., the variables associated with observed variablesin a cluster form a variable subset); an example is shown in FIG. 3 . Inexamples, observed variables (e.g., k-mers) can be clustered based onvariable analysis parameters. For example, linked or otherwisecorrelated observed variables can optionally be clustered together. In aspecific example, k-mer variables can be clustering using k-meansclustering, wherein the k is specified such that no cluster exceeds athreshold variable subset size. However, any other clustering method canbe used.

In a fourth variant, the variable subset is determined using variablewindow selection methods (e.g., described in S400). For example, thevariable subset can be selected based on the variable(s) of interest(e.g., wherein the subset includes variables flanking the variable(s) ofinterest, etc.). An example is shown in FIG. 2 .

However, the variable subset can be selected: randomly, by shuffling thevariable subset (e.g., iteratively removing less-active orless-influential variables and selecting new variables), resampled untila test variable quality is met (e.g., the fit quality, the R² of themodel fit, the RMSE, the log likelihood difference, the similaritybetween the test variable distribution and the variable of interest'sdistribution, etc.), quality is met and/or otherwise selected.

However, any variable selection method can be used.

The variable analysis parameters can include: autocorrelation analysis(e.g., patterns), linkage disequilibrium analysis, evolutionary historyof the population, principal component analysis, kinship analysis,variable location, correlation strength, effective population size,summary statistics (e.g., distribution, parameters of avariable-variable association model or phenotype-variable associationmodel using the variables, etc.), and/or any other variable analyses. Ina specific example, an analysis model can be used to determine (e.g.,extract) the analysis parameters based on observed variables (e.g., anobserved variable dataset). In a second specific example, the variableanalysis parameters can be retrieved from a dataset.

The variable subset can be symmetric relative to variable(s) of interestor non-symmetric. The variable subset can optionally be selected suchthat the number of variables is less than a threshold number. In anexample, when the number of organisms in the population is N, thethreshold number can be between 0.1*N-10*N or any range or valuetherebetween (e.g., 0.5*N-1*N, N, N*⅔, etc.), but can alternatively beless than 0.1*N or greater than 10*N. In another example, the thresholdnumber can be between 10 loci-5000 loci (e.g., 50 loci-250 loci), butcan alternatively be less than 10 loci or greater than 5000 loci. Inanother example, the threshold number can be between 500 bases-10,000kilobases (e.g., 1 kilobases-10 kilobase), but can alternatively be lessthan 500 bases or greater than 10,000 kilobases.

Selecting the variable subset can optionally include selecting a subsetof a variables corresponding to a first variable type (e.g., genomiccomponent variables), and selecting all environmental variables of asecond variable type (e.g., environmental variables). An example shownin FIG. 3 .

However, the variable subset can be otherwise determined.

Removing information from variables of interest S300 can function toremove the variable's influence on the phenotype (in aphenotype-variable association model). S300 can be performed after S100,after S200, iteratively for each variable in a set (e.g., across theobserved data set), and/or at any other time.

S300 can be performed for one or more variables of interest in the setof variables (e.g., concurrently and/or serially). S300 is preferablyperformed for a single variable of interest (e.g., each iteration ofS300 is performed for a single variable of interest), but canalternatively be performed for multiple variables of interest (e.g.,determining a single test variable corresponding to a set of variablesof interest). The variables of interest can be: manually selected,automatically selected (e.g., wherein each iteration of S300 includes asubsequent variable of interest), randomly selected and/or otherwiseselected from the set of variables. The set of variables can be: thevariable subset determined in S200 (e.g., where the observed data setincludes only those variables in the subset of variables), all variablesused in S100 (e.g., all genomic components in the genome), and/or anyother set of variables.

Removing information from a variable of interest can include replacingand/or otherwise changing the observed variable associated with thevariable of interest. In a first variant, removing information includesremoving the observed variable from a phenotype-variable associationmodel (e.g., where the model predicts phenotype values without using theobserved variable). In a second variant, a test variable (with testvalues) can be determined as a replacement for the observed variable.For example, S300 can include determining a variable window (e.g., S400)and determining test values for the variable(s) of interest using thevariable window (e.g., S450). However, information can be otherwiseremoved.

The method can optionally include determining a variable window S400,which can function to determine a subset of variables, wherein thecorresponding observed variables can be used to predict a test variablefor the variable(s) of interest (e.g., wherein the variablecorresponding to the test variable is not within the variable window).In a specific example, the variable(s) of interest and the variablewindow can both be within the variable subset determined in S200 (e.g.,within a shared k-mer cluster).

The variable window size and/or other parameters can be fixed orvariable (e.g., based on the variable of interest). The variable windowsize is preferably less than a threshold number of variables. In anexample, when the number of organisms in the population is N, thethreshold number can be between 0.1*N-10*N or any range or valuetherebetween (e.g., 0.5*N-1*N, N, N*⅔, etc.), but can alternatively beless than 0.1*N or greater than 10*N. In another example, the thresholdnumber can be between 10 loci-5000 loci (e.g., 50 loci-250 loci), butcan alternatively be less than 10 loci or greater than 5000 loci. Inanother example, the threshold number can be between 500 bases-10,000kilobases (e.g., 1 kilobases-10 kilobase), but can alternatively be lessthan 500 bases or greater than 10,000 kilobases. The variable window ispreferably positioned relative to the variable of interest (e.g.,centered about the variable of interest, offset from the variable ofinterest, start or end from the variable of interest, be within athreshold distance from the variable of interest, etc.), but can beotherwise positioned. The variable window can be symmetric about thevariable of interest (e.g., including optional truncation or wrappingwhen the variable is at an end of a variable set) or non-symmetric.

The variable window can be manually determined, determined using a model(e.g., a variable window model, a variable-variable association model,etc.), predetermined, randomly determined, and/or otherwise determined.

In a first variant, the variable window is fixed relative to thevariable of interest. In a first example, the variable window can be afixed size, and positioned symmetric about the variable of interest(e.g., wherein the variable window includes two flanks on either side ofthe variable of interest in the variable set). In a second example, thevariable window includes all variables in the subset of variables (e.g.,S200).

In a second variant, the variable window can be determined based onvariable analyses. For example, the variable window can vary dynamicallybased on the variable analysis parameters associated with the variableof interest. In examples, the variable window can be determined basedon: linkage disequilibrium (e.g., wherein the variable window includesor excludes other variables in linkage disequilibrium with the variableof interest); variable of interest location (e.g., known location,predicted location, etc.); local autocorrelation patterns; correlationstrength (e.g., wherein the variable window includes other variablestightly correlated with the variable of interest); and/or otherwisedetermined. An example is shown in FIG. 4 .

In a third variant, the variable window can be adaptively determined.For example, the variable window can be iteratively re-determined (e.g.,using a variable window model) until one or more criteria are satisfied.The criteria can include a variable window evaluation metric criterion(e.g., the variable window evaluation metric rising above a threshold),a number of iterations, a number of iterations without an increase inthe model metric, completing a cycle through all variables in thevariable set (e.g., in the variable subset), a threshold criterion,and/or any other criterion. The variable window evaluation metric ispreferably a model metric for a variable-variable association model(e.g., any model metric described in S600), wherein thevariable-variable association model is used to determine a test variablefor the variable of interest based on observed variables for variablesin the (current iteration) variable window. For example, thevariable-variable association model can be re-determined (e.g.,re-trained) in each iteration. Alternatively, the variable windowevaluation metric can be any other assessment of test variabledetermination.

In a first example, in each iteration, the variables in the variablewindow are randomly selected from the variable set (e.g., the variablesubset). An example is shown in FIG. 5A. In a second example, thevariable window is segmented into high-importance variables (e.g., anactive set) and low-importance variables (e.g., a shuffled set) based onan association metric for each variable in the variable window (e.g.,association metrics as described in Shoo, for the variable-variableassociation model), wherein the association metrics are determined basedon the variable-variable association model. In each iteration, thelow-importance variables are then replaced with new variables, and thevariable window is re-segmented for the next iteration (e.g., wherein anew variable can replace a high-importance variable if the associationmetric for the new variable is above the association metric for thehigh-importance variable). An example is shown in FIG. 5B.

However, the variable window can be otherwise determined.

The method can optionally include determining a test variable for thevariable(s) of interest S450, which can function to generate a variableto stand-in for one or more corresponding observed variables (e.g., usedas inputs in in a phenotype-variable association model), to generate anegative control for one or more observed variables, to removeinformation from one or more observed variables (e.g., while maintaininga suitable variable form and/or distribution such that an originalobserved variable can be exchanged with its corresponding testvariable), and/or to otherwise perturb one or more observed variables.The test variable preferably has the same or substantially the samedistribution (e.g., statistical distribution) as the observed variableassociated with the same variable of interest, but alternatively canhave a different distribution. S450 can be performed after S400 and/orat any other time.

The test variable can be generated using a variable-variable associationmodel, be randomly determined, be perturbed, be manually determined,and/or be otherwise determined. In a first variant, determining a testvariable for a corresponding observed variable can include replacing theobserved variable values with null values. In a second variant,determining a test variable can include randomly generating values toreplace the observed variable values. In a third variant, determining atest variable can include adding noise to the observed variable values.In a fourth variant, determining a test variable can include determininga distribution of observed variable values based on the correspondingobserved variable, and generating test variable values to match thedistribution (e.g., a genotype distribution). For example, thedistribution can be modeled (e.g., as a gaussian distribution), whereintest variable values can be randomly selected from the modeleddistribution. An example is shown in FIG. 10 . In a fifth variant, thetest variable can be determined using a process model (e.g.,representing how the variable values are generated). For example, theprocess model can be a forward-in-time evolution model. Inputs to theprocess model can include variable analysis parameters, other geneticparameters, and/or any other information. Outputs from the process modelcan include test variables (e.g., including synthetic variable values).An example is shown in FIG. 9 . In a sixth variant, determining a testvariable can include determining (e.g., training) a variable-variableassociation model, and determining the test variable using thevariable-variable association model.

Inputs to the variable-variable association model can include: variablevalues (e.g., observed variables including observed variable values), anoptional randomization parameter (e.g., a parameter that can introducerandomness in the model pre- or post-training), and/or any othersuitable inputs. For example, observed variable inputs can include(only) the observed variables corresponding to the variable window.Outputs from the variable-variable association model can include:variable values (e.g., a test variable including test variable values),and/or any other suitable outputs. For example, test variable outputscan include a single test variable (associated with one or more observedvariables) or multiple test variables (e.g., each associated with one ormore observed variables).

In a first embodiment, the variable-variable association model includesa regression fit to observed variables, where the observed variable forthe variable of interest is treated as the dependent variable andobserved variables in the variable window are treated as the independentvariables. The resulting (fitted) regression is then used to determinethe test variable, wherein the test variable is treated as the dependentvariable. An example is shown in FIG. 6 .

In an illustrative example, the variable-variable association model is aregression of the form: V₂˜a₁V₁+a₃V₃, where V₂ is the observed variablefor variable 2 (e.g., locus 2), V₁ is the observed variable for variable1, V₃ is the observed variable for variable 3 (e.g., where variable 1and 3 are selected based on the variable window), and a₁ and a₂ aredetermined coefficients. The determined coefficients can be used tocalculate the test variable for variable 2: T₂=a₁V₁+a₃V₃. For example,test variable T can include T values for each organism, calculated usingthe regression and the organism's observed values for V₁ and V₃.

In a second embodiment, the variable-variable association model includesa machine learning model (e.g., an autoencoder, CNN, etc.) trained topredict test variable values based on the values for other observedvariables. For example, the variable-variable association model can betrained on a first subset of variables (e.g., a first area of thegenome), and then applied to a second subset of variables to determinetest variables for variables of interest (e.g., outputting multiple testvariables for multiple variables of interest). The first subset ofvariables can exclude the second subset of variables, include a portionof the second subset, be separated from the second subset by a thresholddistance (e.g., genomic distance), or be otherwise related to the secondsubset of variables. In an example, the variable-variable associationmodel can determine an encoding for the variables of interest (e.g., asingle encoding for multiple variables of interest) based on observedvariables in the variable window, wherein the encoding can be decoded togenerate the individual test variables for the variables of interest.Examples of training and test variable value prediction is shown in FIG.7A and FIG. 7B. In another example, the model can predict the locivalues for the test variables based on the flanking (observed) locivalues within the second subset of variables (e.g., using a deeplearning network, a generative model, an autoencoder, etc.). In anotherexample, the model can be a CNN trained to predict the phenotype valuebased on the variable values, wherein the CNN can implicitly learn thecausal variables and/or features (e.g., intermediate variables). Thecausal variables can optionally be explicitly determined from the CNN(e.g., using explainability methods, such as SHAP values, lift,coefficient analysis, etc.), and/or not explicitly determined (e.g.,wherein the CNN is used to determine the phenotype value as-is).

In a third embodiment, the variable-variable association modeltransforms variables to a reduced dimension space (e.g., latent space).For example, the variable-variable association model can compress (e.g.,embed, reduce, etc.) the set of variables into a set of features,wherein the set of features is smaller than the set of variables (e.g.,illustrative example shown in FIG. 8 ). Transformed observed variables(e.g., observed features) and transformed test variables (e.g., testfeatures) can optionally be treated as observed variables and testvariables, respectively, in all or parts of the method. For example, aphenotype-variable association model can include a relationship betweena phenotype and transformed observed variables (e.g., the observedembedded variable, the observed feature, etc.) and/or transformed testvariables (e.g., the test embedded variable, the test feature, etc.;wherein the test feature preserves the observed feature's distribution),wherein transformed causal variables (e.g., causal embedded variable,causal feature, etc.) can be identified using the phenotype-variableassociation model (e.g., using the features or embedded variables as thephenotype-variable association model's independent variables) anddecoded (using the variable-variable association model) to determine thecausal variables. An example is shown in FIG. 8 .

In a first example, the variable-variable association model can be anautoencoder that is trained (e.g., trained using a different subset ofvariables than the subset associated with the variables of interest) tocompress multiple observed variables into an encoding that can functionas a transformed observed variable. In a second example, a first layerof a neural network (e.g., a phenotype-variable association model) canfunction as the variable-variable association model, wherein the firstlayer (e.g., a pooling layer) transforms observed variables intotransformed observed variables. In a third example, principal componentanalysis and/or any other dimensionality reduction technique can be usedto compress variables into transformed variables. A transformed testvariable can optionally be determined based on the transformed observedvariables. In examples, the transformed test variable can be determined:using a different variable-variable association model (e.g., including arelationship between transformed observed variables and the transformedtest variable); by selecting (e.g., randomly selecting) transformed testvalues from a distribution of transformed observed values; and/or usingany other test variable generation methods (e.g., as previouslydescribed).

In any embodiment, multiple instances of the variable-variableassociation model can optionally be determined (e.g., multiple instancesseparately trained, a single trained variable-variable association modelwith a different randomization parameter for each instance, etc.),wherein a test variable can be determined for the variable(s) ofinterest using each model instance.

Determining test variables can optionally include a test variable check,wherein a test variable does not pass if the test variable results in: a(substantial) deviation from the joint distribution between observedvariables, a (substantial) deviation in the distribution of test valueswithin the test variable relative to the distribution of observed valuesin the corresponding observed variable, and/or otherwise deviates fromallowable criteria. Test variables that do not pass the test variablecheck can be adjusted, discarded, and/or otherwise processed.

The test variables can optionally be aggregated into a test data set(e.g., of the same form as the observed data set, of a modified form,etc.). In a first variant, the test data set contains test variables(e.g., with no observed variables) for each of the set of variables(e.g., the subset of variables). In a second variant, the test data setis the observed dataset with one or more observed variables replacedwith the corresponding test variables (e.g., associated with the samevariables). The observed variables that are replaced can be associatedwith variables of interest (e.g., variables to be tested for a phenotypeassociation). In an example, the test data set can be the observed dataset with a single observed variable exchanged with its correspondingtest variable (e.g., when the set of target genomic components is asingle genomic component). Examples are shown in FIG. 11B and FIG. 11C.

However, test variables can be otherwise determined.

Determining a phenotype-variable association model S500 functions todetermine a model relating the phenotype to the set of variables (e.g.,where the model predicts phenotype values given variable values). S500can be performed after S300, before S300, after S100, after S200, and/orat any other time. The phenotype-variable association model can bedetermined: once (e.g., based on the observed data), multiple times(e.g., once for each variable of interest), and/or any other number oftimes.

The phenotype-variable association model can be: for a specificphenotype, for a phenotype set, and/or any other suitable combination ofphenotypes. The phenotype-variable association model can be determinedbased on the observed values (e.g., from S100), based on test values(e.g., from S450), and/or based on any other set of data. Thephenotype-variable association model preferably does not modelinter-variable interactions, but alternatively can model inter-variableinteractions. The phenotype-variable association model can be: selected,learned, fit, or otherwise determined. Inputs to the phenotype-variableassociation model can include: variable values (e.g., observed variablesincluding observed variable values, test variables including testvariable values, etc.) and/or any other suitable inputs. Outputs fromthe phenotype-variable association model can include: phenotype values(e.g., a phenotype including predicted observed phenotype values) and/orany other suitable outputs.

In a first variant, the phenotype-variable association model is a neuralnetwork trained to predict a phenotype (e.g., including a vector ofphenotype values) based on the set of variables (e.g., including vectorsof variable values). The phenotype-variable association model ispreferably trained using observed variables and observed phenotypes forthe population of organisms, but can additionally or alternatively betrained using any other phenotypes and/or variables.

In a second variant, the phenotype-variable association model is aregression. For example, S500 can include determining (e.g.,calculating, fitting, etc.) a regression between a phenotype (e.g., thedependent variable) and variables (e.g., the independent variables). Inthis example, the variable values can include only observed variablevalues, only test variable values, and/or a combination of observed andtest variable values. In a first example, S500 can include: determiningan observed variable for each of the set of variables (e.g., S100),determining an observed phenotype (e.g., S100), and fitting thephenotype-variable association model based on the observed variables andthe observed phenotype. In a second example, S500 can include:determining a test variable for each of the set of variables (e.g.,S450), determining an observed phenotype (e.g., S100), and fitting thephenotype-variable association model based on the test variables and theobserved phenotype. In a third example, S500 can include: determining anobserved variable for each of the set of variables (e.g., S100),determining a test variable for the variables of interest and/or foreach of the set of variables (e.g., S450), determining an observedphenotype (e.g., S100), and fitting the phenotype-variable associationmodel based on the observed variables, the test variables, and theobserved phenotype.

The phenotype-variable association model can use the variable valuesfor: all variable types (e.g., genomic component, DNA methylation, geneexpression, environmental variables, transcriptome variables, etc.), asubset of variable types (e.g., only genomic component, only genomiccomponent and gene expression, etc.), and/or any other combination ofvariable types. In a first example, the model predicts a value for asingle phenotype based on values for a set of genomic componentvariables (e.g., genotypes). In a second example, the model predicts avalue for a first phenotype and a value for a second phenotype based onvalues for genomic component variables, environmental variables, geneexpression variables, and/or DNA methylation variables. In a thirdexample, a first instance of the model predicts a phenotype value basedon values for a set of genomic component variables, a second instance ofthe model predicts a phenotype value (e.g., for the same phenotype)based on values for a set of environmental variables, and a thirdinstance of the model predicts a phenotype value based on values for aset of gene expression variables.

However, the phenotype-variable association model can be otherwisedetermined.

Identifying causal variables associated with a phenotype Shoo functionsto reduce the variable dimensionality and/or identify variables thatinfluence trait expression. Shoo can be performed after S300, afterS500, and/or at any other time.

Causal variables can be a subset of the set of variables, wherein a setof causal variables can be selected for: a set of phenotypes (e.g.,target phenotypes), an individual phenotype, and/or can be otherwiseselected. The causal variables can be selected based on observedvariables, test variables, and observed phenotypes (e.g., observedvariable values, test variable values, and observed phenotype values foreach organism in a population). However, the causal variable can beselected based on any other variable values and/or phenotype values.

The causal variables can be selected from the set of variables:manually, using a phenotype-variable association model, randomly, and/orotherwise selected. Selecting the causal variables using aphenotype-variable association model can include: determining anassociation metric for each variable based on the phenotype-variableassociation model; and selecting the causal variables from the set ofvariables based on the respective association metric for each variable.

Determining an association metric for each variable based on the firstmodel can function to extract information on the relationship betweeneach variable and one or more phenotypes. Association metrics fordifferent variables can be independently determined or, alternatively,can be concurrently determined. Multiple variables can optionally beassociated with the same association metric (e.g., example shown in FIG.17 ).

Determining the association metric for a variable of interest preferablyincludes determining a model metric for the phenotype-variableassociation model with and without the information for the variable ofinterest (e.g., an observed model metric and a test model metric,respectively), and determining the association metric based on acomparison between the model metrics. The comparison can include adifference, a ratio, a statistical measure, a distance metric, anaggregate of comparisons, an absolute value thereof, and/or any othercomparison. Examples of model metrics include: the variable weight(e.g., a coefficient in the model), the model's phenotype prediction,the model's loss, the model's variance (e.g., coefficient ofdetermination), a model fit metric (e.g., R-squared, RMSE, etc.),log-likelihood evaluation, a variable classification (e.g., causal ornon-causal variable), a model classification (e.g., predictive ornon-predictive), statistical measure, summary statistics, and/or anyother value determined based on the phenotype-variable associationmodel. Examples are shown in FIG. 12 , FIG. 13 , FIG. 14 , FIG. 15 ,FIG. 20 , and FIG. 21 . Alternatively, the association metric can bedetermined based on a single model metric value (e.g., a measure ofassociation between the variable of interest and a phenotype). In anillustrative example, the association metric can be a coefficient forthe variable of interest in the phenotype-variable association model(e.g., where test variable values are not used in the model).

In a first example, a single instance of the phenotype-variableassociation model is used to determine the observed model metric andtest model metric. In a specific example, the phenotype-variableassociation model includes both a test variable for the variable ofinterest and an observed variable for the variable of interest (e.g.,includes a single test variable for the variable of interest and anobserved variable for each of the set of variables; includes a testvariable and an observed variable for each of the set of variables;etc.), wherein a test model metric (e.g., a variable weight) can bedetermined for the test variable and an observed model metric can bedetermined for the observed variable. The association metric can bedetermined based on a comparison between the test model metric and theobserved model metric. An example is shown in FIG. 17 .

In a second example, two instances of the phenotype-variable associationmodel are used to determine the observed model metric and test modelmetric. In a specific example, a test phenotype-variable associationmodel includes a test variable for the variable of interest and anobserved phenotype-variable association model includes an observedvariable for the variable of interest (e.g., includes only observedvariables for each of the set of variables), wherein a test model metric(e.g., a variable weight, a model loss, etc.) can be determined based onthe test phenotype-variable association model an observed model metriccan be determined based on the observed phenotype-variable associationmodel. The association metric can be determined based on a comparisonbetween the test model metric and the observed model metric. Inexamples, the test phenotype-variable association model can include, inaddition to the test variable for the variable of interest: observedvariables for all or a subset of the set of variables (e.g., the entireset of variables except the variable of interest); test variables foreach of the set of variables (e.g., without observed variables); and/orinclude any other combination of observed and/or test variables. Anexample is shown in FIG. 13 and FIG. 15 .

In a third example, more than two instances of the phenotype-variableassociation model are used to determine the observed model metric andtest model metric. In a specific example, an observed phenotype-variableassociation model includes an observed variable for the variable ofinterest (e.g., includes only observed variables for each of the set ofvariables), and each of a set of test phenotype-variable associationmodels includes a test variable instance for the variable of interest.An observed model metric (e.g., a variable weight, a model loss, etc.)can be determined based on the observed phenotype-variable associationmodel, and a test model metric can be determined for each testphenotype-variable association model, wherein the association metric canbe based on an aggregate of comparisons between the observed modelmetric and each test model metric. In an illustrative example, eachcomparison includes a binary value (e.g., 0 corresponds to a test modelmetric greater than or equal to the observed model metric; 1 correspondsto a test model metric less than the observed model metric), wherein theaggregate of comparisons (e.g., average) represents a statisticalmeasure (e.g., p-value, probability, etc.) that the test model metric isgreater than the observed model metric. An example is shown in FIG. 16 .

In a fourth example, the association metric can be determined based on atest summary statistic (e.g., for the variable of interest). The testsummary statistic can be determined based on the neighboring summarystatistics (e.g., for variables in the variable window), based on theneighboring variable values (e.g., for the variables in the variablewindow), and/or based on any other suitable information. Test summarystatistics can be generated in the same and/or similar manner asgenerating test variables, and/or otherwise determined. In a firstspecific example, this can include: fitting an observedphenotype-variable association model using observed values; fitting ametric-variable association model that treats the summary statistic ofthe variable of interest (e.g., the variable of interest's weight fromthe observed phenotype-variable association model) as the dependentvariable and the summary statistics of the neighboring variables as theindependent variables; and determining the test summary statistic basedon the metric-variable association model (e.g., by calculating the testsummary statistic value using the neighboring summary statistics, bycalculating the test summary statistic value without using observedvariable values; by calculating the test summary statistic value usingthe observed neighboring variable values; etc.). An example is shown inFIG. 20 . In a second specific example, this can include: fitting anobserved phenotype-variable association model using observed values;fitting a metric-variable association model that treats the summarystatistic of the variable of interest (e.g., the variable of interest'sweight from the observed phenotype-variable association model) as thedependent variable and the neighboring variables as the independentvariables; and determining the test summary statistic based on themetric-variable association model (e.g., by calculating the test summarystatistic value using the observed neighboring variable values, etc.).An example is shown in FIG. 21 . In a first variant, association metricsare determined using a phenotype-variable association model thatincludes a regression.

In a first embodiment, determining association metrics using aregression includes calculating a regression based on the observedvariables and their corresponding test variables:

P˜a ₁ V ₁ +a ₂ V ₂ + . . . +a _(n) V _(n) +b ₁ T ₁ +b ₂ T ₂ + . . . +b_(n) T _(n)

where P is the observed phenotype variable, V_(i) is the observedvariable for variable i, T_(i) is the test variable for variable i, anda_(i) and b_(i) are the observed and test coefficients for the observedand test variables for variable i, respectively. In this embodiment, thecoefficients are the model metrics (e.g., the observed and testcoefficients are the observed and test model metrics, respectively),wherein the comparison between the coefficients is the associationmetric used to identify the causal variables associated with thephenotype. In an example, the difference between the observed and testcoefficients for each variable (e.g., a_(n)−b_(n)) is the associationmetric for the respective variable.

In a second embodiment, determining association metrics using aregression includes: calculating individual regressions for eachvariable (x) from a set of variables (1 to n):

P˜a ₁ V ₁ + . . . +a _(x) T _(x) + . . . +a _(n) V _(n)

where P is the observed phenotype variable, V_(i) is the observedvariable for variable i, T_(x) is the test variable for variable x, anda_(i) is the coefficient for the variable for variable i. In an example,the variation (e.g., R²) for the regressions can be used as the modelmetric. In a specific example, the observed variation (e.g., R²) for theobserved regression (e.g., from S200) and the test variation (e.g., R²_(T)) from each individual regression can be calculated and compared todetermine the association metric for each variable.

However, regression-based phenotype-variable association models can beotherwise used to determine association metrics.

In a second variant, association metrics are determined using aphenotype-variable association model that includes a machine learningmodel (e.g., a neural network, Bayesian model, SVM, etc.).

In a first embodiment, determining association metrics using a machinelearning model includes: training an observed phenotype-variableassociation model to predict the observed phenotype values based on theobserved variable values; constructing a test data set having testvalues (e.g., determined in S450) for one or more variables of interest;and training a test phenotype-variable association model to predict theobserved phenotype values based on the test data set. The observed andtest phenotype-variable association model preferably have the same basemodel but can alternatively have different base models. In an example, amodel performance metric (e.g., loss, accuracy, etc.) can be used as themodel metric, wherein the association metric for the variable(s) ofinterest can be a comparison between the model metric for the observedphenotype-variable association model and the model metric for the testphenotype-variable association model.

In a second embodiment, determining association metrics using a machinelearning model includes: training a phenotype-variable association modelto predict the observed phenotype(s) based on the observed variables;generating test variables by replacing the observed variable value foreach organism's genotype with a test value (e.g., S450); predicting thephenotype value using the test variables (e.g., replacing one or moreobserved variables with test variables as inputs to thephenotype-variable-association model); and calculating the modelperformance for the prediction. In an example, a model performancemetric can be used as the model metric, wherein the association metriccan be determined based on a comparison between model performance using(only) the observed variables, and model performance when one or moreobserved variables are replaced with test variables.

In a third embodiment, determining association metrics using a machinelearning model includes training a model to predict the phenotype valuesusing the observed variable values; treating each variable as a feature,and using feature selection and/or explainability methods (e.g., localinterpretable model-agnostic explanations, Shapley Additiveexplanations, partial dependence plots, etc.) to determine theassociation metrics (e.g., indicating how influential a variable is).

However, association metrics can be otherwise determined.

Selecting the causal variables from the set of variables based on therespective association metric for each variable can include selecting:variables with nonzero (e.g., positive and negative; valence agnostic)association metrics, variables with association metrics satisfying athreshold condition (e.g., absolute value above or below a threshold;above a first positive threshold or below a second (negative) threshold;etc.), a predetermined number and/or percent of variables with thelargest positive association metric values, a predetermined numberand/or percent of variables with the largest negative association metricvalues, a predetermined number and/or percent of variables with thelargest absolute association metric values, variables with associationmetrics satisfying a statistical measure condition (e.g., variables withassociation metrics that are outliers, variables with associationmetrics that are at least a threshold standard deviation above/below themean, etc.), a combination thereof, and/or any other variable subset. Ina first example, the variables can be ranked based on their associationmetrics, wherein the selected causal variables can be the top mvariables. In a specific example, m can be between 1-10,000 or any rangeor value therebetween (e.g., 10-1,000), but can alternatively be lessthan 1 or greater than 10,000. In a second example, the variables can beranked based on their association metrics, wherein the selected causalvariables can be the variables n standard deviations from the mean. In aspecific example, n can be between 1-5 or any range or valuetherebetween (e.g., 2, 3, 4, etc.), but can alternatively be less than 1or greater than 5. Examples are shown in FIG. 18A and FIG. 18B. In athird example, variables with the top percentiles of association metricvalues (e.g., top 20%, 15%, 10%, 5%, 2%, etc. of variables) can beselected as the causal variables. The association metric value can bevalence agnostic, or account for valence. Additionally or alternatively,all variables of a specific type and/or classification can be selected(e.g., all environmental variables). However, causal variables can beotherwise selected form the set of variables.

The causal variables can optionally be a superset of multiple causalvariable sets. For example, S600 can be repeated with a different modelfor each phenotype in a set, wherein different causal variable sets areselected for each phenotype and then aggregated to generate a supersetof causal variables (e.g., example shown in FIG. 19 ).

However, causal variables can be otherwise identified.

The method can optionally include determining breeding parameters toachieve a target causal variable value set S700, which functions todetermine an organism that will exhibit a target phenotype. The organismcan subsequently be bred using the breeding parameters; however, thebreeding parameters can be otherwise used. All and/or portions of S700can be performed after S600, S710 independent of S100-S600, beforeand/or after any of S100-S600, and/or at any other time. S700 orcomponents thereof can be performed: once, iteratively until a stopcondition is met (e.g., for a predetermined number of iterations, untila marginal improvement in the predicted phenotype value set falls belowa threshold for predetermined number of iterations, until the predictedphenotype value matches the target phenotype value, etc.), and/or anynumber of times.

As shown in FIG. 22 , determining breeding parameters S700 can include:determining a phenotype model using the causal variables S710 (selectedin S600), optionally determining a target causal variable value setS730, and determining breeding parameters to achieve the target causalvariable value set S750. Alternatively, the breeding parameters can bedetermined experimentally (e.g., by growing the organisms, etc.), and/orbe otherwise determined.

Determining a phenotype model using the causal variables S710 functionsto determine an phenotype model (e.g., updated phenotype-variableassociation model) with a reduced number of variable inputs relative tothe phenotype-variable association model used for causal variableselection (in S600). S710 can be performed after Shoo and/or any othertime. S710 or components thereof can be performed: once, iterativelyuntil a stop condition is met (e.g., for a predetermined number ofiterations, until a marginal improvement in the predicted phenotypevalue set falls below a threshold for predetermined number ofiterations, until the predicted phenotype value matches the targetphenotype value, etc.). A different phenotype model is preferablydetermined for each phenotype set (e.g., trait, trait value, etc.);alternatively a phenotype model for one phenotype set can be used forother phenotype sets.

The causal variables can be determined using one or more of S100-S600,be manually determined, be predicted, be learned (e.g., by a neuralnetwork, trained to predict a set of phenotype values based on a set ofvariable values), be randomly selected, and/or be otherwise determined.

The phenotype model is preferably a model trained to determine (e.g.,predict) one or more phenotype values for an organism, given theorganism's causal variable values. The phenotype model is preferably thesame model class as the phenotype-variable association model used inShoo, but can alternatively be a different model class (e.g., alow-dimension version) and/or otherwise configured. The phenotype modelcan be and/or include: a regression, support vector machine, classifier,random forest, kernel methods, generative model, clustering model,Bayesian model (e.g., HMM), neural network (e.g., CNNs, DNNs, etc.),equation, probability, deterministics, genetic program, generativemodel, and/or any model that could fit the biology. The phenotype modelcan be a linear model, nonlinear model (e.g., regression, neuralnetwork), and/or other model. The phenotype model can be learned (e.g.,using supervised learning, unsupervised learning, etc.), fit, trained,predetermined, and/or can be otherwise determined.

The phenotype model can be determined (e.g., trained) based on one ormore phenotypes and one or more causal variables (e.g., one phenotypevalue and one set of causal variable values for each of a set oforganisms). Phenotype values and causal variable values are preferablyobserved (e.g., experimentally derived values, the same values as thosedetermined in S100, etc.), but can alternatively be synthetic values(e.g., to simulate variables outside an observed variable distribution).The phenotype model can include more, less, or the same variable typesas the phenotype-variable association model used in Shoo. The phenotypemodel preferably models inter-variable interactions, but alternativelycan ignore inter-variable interactions.

In a first variant, the phenotype model is a neural network trained topredict phenotype values (e.g., trait values) given known values (e.g.,observed values, predicted values, etc.) for the causal variables forone or more organisms.

In a second variant, the phenotype model is a regression fit to theobserved phenotype values and the observed causal variable values,wherein the phenotype values are treated as the dependent variables andthe causal variable values are treated as the independent variables.

In a third variant, the phenotype model is a machine learning model(e.g., neural network) trained to predict the phenotype value(s) basedon the parent organisms' variable values (e.g., the parent organisms'causal variable values).

In a first example, the phenotype model predicts a value for a singlephenotype based on values for causal genomic component variables (e.g.,genotypes), causal environmental variables, causal gene expressionvariables, and causal DNA methylation variables. In this example, theset of causal variables (determined in Shoo) includes causal genomiccomponent variables, causal environmental variables, causal geneexpression variables, and causal DNA methylation variables.

In a second example, the phenotype model predicts a value for aplurality of phenotypes (e.g., trait values), wherein the phenotypemodel's variables can be determined from the causal variables for eachof the respective phenotypes (e.g., include all causal variables for allof the phenotypes, the intersection of the causal variable sets, anyother suitable combination of the respective causal variable sets,etc.). The causal variables for each phenotype is preferablyindependently determined (e.g., using different instances of S100-S600;using different phenotype-variable association models; etc.), but canalternatively be determined together (e.g., using the same instance ofS100-S600; using the same phenotype-variable association model; etc.).For example, the causal variables can be a superset of multiple causalvariable sets. In this example, S600 can be repeated with a differentmodel for each phenotype in a set, wherein different causal variablesets are selected for each phenotype and then aggregated to generate asuperset of causal variables (e.g., example shown in FIG. 19 ).

In an illustrative example, a first phenotype and a value for a secondphenotype based on values for causal genomic component variables. In aspecific example, the causal genomic component variables are a supersetof (e.g., include both of) the causal variable sets selected in S600 forthe first and second phenotypes. In a third example, the phenotype modelpredicts a value for a phenotype based on values for causal genomiccomponent variables conditioned on a set of covariates, wherein the setof covariates includes causal environmental variables, causal geneexpression variables, causal DNA methylation variables, and/or any othercovariates. In a specific example, the phenotype model predicts a valuefor a phenotype based on values for causal genomic component variablesconditioned on a first set of covariates, wherein the conditioned causalgenomic component variables are subsequently conditioned on a second setof covariates (e.g., and optionally iteratively conditioned on anynumber of covariate sets). In a fourth example, the phenotype modelpredicts a value for a phenotype based on values for causaltranscriptome variables, and optionally values for causal genomiccomponent variables, causal environmental variables, and/or causal geneexpression variables. Examples are shown in FIG. 23A, FIG. 23B, and FIG.23C.

However, the phenotype model can be otherwise determined.

Determining a target causal variable value set S730 functions toidentify values for the causal variables that will produce a set oftarget values for the phenotype(s). S730 can be performed after S710,with S750, and/or any other time. S730 or components thereof can beperformed: once, iteratively until a stop condition is met (e.g., for apredetermined number of iterations, until a marginal improvement in thepredicted phenotype value set falls below a threshold for predeterminednumber of iterations, until the predicted phenotype value matches thetarget phenotype value, etc.).

In a first variant, the target causal variable value set is selectedfrom one or more candidate causal variable value sets based on phenotypevalues predicted by the phenotype model. This variant can include:generating candidate causal variable value sets S731, predicting aphenotype value for each candidate causal variable value set using thephenotype model S733, and/or selecting the target causal variable valueset from the candidate causal variable value sets S735. In variants,this can be iteratively performed until the selected causal variablevalues satisfy a condition (e.g., quality condition, statisticalcondition, etc.).

Generating the candidate causal variable sets S731 function to createcandidate sets of causal variable values. Candidate values for eachcausal variable can be: predetermined (e.g., manually determined,specified by growing conditions, etc.), randomly determined (e.g., whichcan avoid local minima), computed (e.g., based on candidate parentvariable values, using predictive breeding, using environmentalforecasting models, etc.), optimized (e.g., to maximize the predictedtrait value, to minimize the predicted trait value, etc.), observed, acombination thereof, and/or otherwise determined. In a first example,candidate causal variable values can be determined for a fixed growingenvironment, wherein environmental variable values are held constantwhile other causal variables are permuted (e.g., randomly permuted). Ina second example, causal variable values can be selected to optimize aset of causal variables (e.g., to maximize the predicted phenotypevalue, to minimize the predicted phenotype value, etc.). In a thirdexample, all causal variables are randomly permuted to generatecandidate causal variable value sets. In a fourth example, causalvariable values can be determined by virtually crossing sets ofcandidate parent organisms (e.g., with observed and/or known variablevalues) and determining the values for the causal variables from the oneor more virtual children. However, the candidate causal variable valuescan be otherwise determined.

Predicting a phenotype value for each candidate causal variable valueset using the phenotype model S733 can include predicting one or morephenotype values using the phenotype model (determined in S710) given acandidate causal variable value set as input. S733 can be performed forevery candidate causal variable value set, a subset of the candidatecausal variable value sets (e.g., the most common or most frequentlyoccurring candidate causal variable values sets, a random sample of thecandidate causal variable value sets, etc.), and/or for any othersuitable set of candidate causal variable values.

S733 can optionally include determining a breeding value for an organism(characterized by a candidate causal variable value set) based on theorganism's predicted phenotype values. For example, the breeding valuecan be determined using: EBV=b*(P_(individual)−P_(avg)); where EBV isthe estimated breeding value, b is heritability, P_(individual) is thepredicted phenotype value for the candidate causal variable value set,and P_(avg) is the average phenotype value (e.g., predicted and/orobserved) for a group of causal variable value sets corresponding to anorganism population. Heritability can be determined empirically, viabreeding simulations, and/or can be otherwise determined. The estimatedbreeding value is preferably defined such that EBV increases asP_(individual) approaches a target phenotype value set, butalternatively can be otherwise defined.

S733 can optionally include determining a confidence score for thephenotype value prediction. In a first embodiment, the confidence scoreis determined based on a calculated loss between predicted phenotypevalues and observed phenotype values (e.g., for a single organism,averaged across organisms in a population, etc.). In a secondembodiment, the confidence score is output by the phenotype model itself(e.g., where S710 includes training the model to output the confidencescore). However, the confidence score can be otherwise determined.

Selecting the target causal variable value set (TCVVS) from thecandidate causal variable value sets (CCVVS) S735 functions to determinewhich causal variable value set to use (e.g., determine the targetorganism to breed). The target causal variable value set can be selectedbased on: predicted phenotype values (e.g., a comparison betweenpredicted and target phenotype values), confidence scores, probabilityof occurrence (e.g., for a given parent organism set, for a givenpopulation, for a given population of parent organisms, etc.), breedingvalues, breeding parameters (e.g., applying a lower weighting to acandidate causal variable value set that requires more breedinggenerations lower; applying a lower weighting to a candidate causalvariable value set that prescribes more expensive treatments or growingconditions; etc.), a combination thereof, and/or any other information.

In a first example, the CCVVS associated with the most performantphenotype value set is selected as the TCVVS (e.g., example shown inFIG. 24A). The most performant phenotype value set can be the set thatmatches the exact values in a target phenotype value set, the set withvalues closest to the values in the target phenotype value set, and/orcan be otherwise defined. In a second example, the TCVVS can be selectedfrom the CCVVSs based on the respective predicted phenotype value setand associated confidence score (e.g., where a CCVVS with a lowconfidence score is less likely to be selected). In a third example, theCCVVS that satisfies a target condition is selected as the TCVVS (e.g.,example shown in FIG. 24B). The target condition can be when thepredicted phenotype value set is within a threshold of the targetphenotype value set, when the estimated breeding value is above athreshold, and/or any other condition. In a fourth example, the TCVVScan be selected based on a distribution. For example, the TCVVS can beselected based on the CCVVS distribution, wherein the TCVVS is thehighest-frequency CCVVS, a set of CCVVS within a predetermined number ofstandard deviations from the mean, and/or otherwise selected based onthe CCVVS distribution. In another example, the TCVVS can be selectedbased on the distribution of phenotype values generated using the CCVVS(e.g., from S733), wherein the CCVVS generating the best combination ofphenotype values (e.g., highest days to flower and harvest weight.

In a second variant, the target causal variable value set is determinedby optimizing a causal variable value set to minimize loss between thepredicted phenotype value set (determined using the updatedphenotype-variable association) and the target phenotype value set. Forexample, the optimization can be performed using one or more causalvariable value set seeds, where each seeded causal variable value setcan be: observed, randomly determined, and/or be otherwise determined.In a third variant, the target causal variable value set is determinedusing Bayesian optimization, wherein an acquisition functioninterrogates the phenotype model to determine the target causal variablevalue set and/or the parents that could generate the target causalvariable value set.

However, the target causal variable value set can be otherwisedetermined.

Determining breeding parameters to achieve the target causal variablevalue set S750 functions to determine the set of parent organisms, theenvironmental conditions, and/or other breeding parameters to breed anorganism with the target set of phenotypes. In variants, S750 candetermine steps to reach the target causal variable value set and/ortarget phenotype from an initial causal variable value set. S750 can beperformed after S710, after S730 (e.g., based on the target causalvariable value set determined in S730), before and/or after any ofS100-S600, and/or at any other suitable time. S750 can be iterativelyperformed, performed once, and/or performed at any other suitable time.

The breeding parameters can include breeding sets (e.g., one or moreorganisms in the population to cross-breed to achieve the target causalvariable value set), a number of breeding generations, treatments (e.g.,irradiation, siRNA gene silencing, nutrient application, etc.), growingconditions, and/or any other methods to transform an initial causalvariable value set to a target causal variable value set. The breedingparameters preferably exclude genetic engineering (e.g., using CRISPR,foreign gene insertion, etc.), but can alternatively include geneticengineering.

In a first variant, the breeding parameters can be determined usingpredictive breeding methods. The predictive breeding methods candetermine steps to breed one or more organisms—each associated with anobserved causal variable value set—to achieve a target organismassociated with the target causal variable value set. The one or moreorganisms can optionally be selected from a larger set of organisms(e.g., existing organisms currently available for breeding; parents).The selected organisms can: have the closest causal variable values tothe target causal variable values, have a subset of causal variablevalues that match the target causal variable values, and/or be otherwiseselected.

In a first example, organism sets (e.g., pairs, triplet, quad, etc.) canbe selected (e.g., from a set of existing organisms, from a selected setof organisms, etc.) for breeding to achieve genotype values in a targetgenomic component variable value set.

In a second example, determining the breeding parameters (e.g., parentorganism set) using predictive breeding can include: predicting a set ofdescendants for each set of parent organisms (e.g., each parentcombination); predicting the set of phenotype values for each of the setof descendants (e.g., for each parent organism set); and selecting theparent organism set(s) that produce descendants with phenotype valuessatisfying a set of conditions (e.g., example shown in FIG. 25 and FIG.26 ).

The descendants can include one or more generations. The descendants canbe predicted using S731 and/or otherwise determined.

The phenotype values (e.g., traits) can be determined based on thephenotype model and the descendant's causal variable values, bedetermined using S733, and/or otherwise determined. The phenotype valuescan be predicted using the same or different phenotype model fordifferent generations. For example, new phenotype models with new causalvariables can be determined for different generations (e.g., using oneor more of S100-S700), since the causal variables contributing to aphenotype can vary across generations.

The parent organism set can be selected based on the descendants'phenotype values, the distribution of the descendants' phenotype values(e.g., example shown in FIG. 26 ), and/or otherwise selected. Examplesof selection conditions can include selecting the parent sets that:arrive at descendants with the target phenotype values the fastest(e.g., in the least number of generations); produce descendants with thebest values (e.g., highest, lowest, etc.) for all or a thresholdproportion of phenotypes; produce descendants with the optimal valuesfor all or a threshold proportion of phenotypes; produce descendantswith the most and/or least descendants with phenotype values over a setof threshold values (e.g., manually-determined threshold values, themean phenotype value, median phenotype value, learned phenotype value,etc.); produce descendants with the most and/or least children withphenotype values of a set of threshold values; produce descendants withthe most stable phenotype values across generations (e.g., the harvestweight is consistently above a threshold value for the most number ofgenerations and/or more than a threshold number of generations, etc.);produce descendants with the least stable phenotype values across apredetermined number of generations (e.g., the descendants haveless-desirable phenotype values after a threshold number ofgenerations); have the highest probability of producing a descendantwithin a predetermined generation with a set of target phenotype values;the cost to produce the descendants with the target phenotype value(s)(e.g., calculated from the number of generations, the cost to grow eachgeneration, the opportunity cost lost while waiting for the generationsto mature, the probability of occurrence, etc.); and/or satisfy anyother suitable set of conditions. However, the parent organism set canbe selected based on an optimization (e.g., over cost, time, probabilityof producing a descendant with target phenotype values, etc.), manuallyselected, automatically selected (e.g., based on satisfaction of one ormore of the aforementioned conditions), and/or otherwise determined.

However, the breeding parameters can be otherwise determined based onpredictive breeding.

In a second variant, determining breeding parameters includesdetermining a treatment of an organism (e.g., applied at a given growthstage) that will alter one or more observed variable values to bring aninitial causal variable value set closer to the target causal variablevalue set. The treatment can be determined using known effects of thetreatment (e.g., known methylation effects), simulations of treatment ata growth stage, and/or using any other information associated with thetreatment and/or the organism. This can be determined in combinationwith the first variant (e.g., wherein the phenotype values can bedetermined based on a combination of the variable values from predictivebreeding and treatment values) and/or independently from the firstvariant. The treatment values can be: predicted (e.g., using thephenotype model), manually specified, randomly determined, and/orotherwise determined. In a first example, a treatment of an organism canbe determined to increase and/or decrease methylation of one or moregenes (e.g., to alter causal DNA methylation variable values, to altercausal genomic expression variable values, etc.). In a second example, agene therapy can be determined to increase and/or decrease geneexpression for one or more genes (e.g., to alter causal genomicexpression variable values). In a third example, genetic modificationsteps can be determined to modify an organism's genome (e.g., to altercausal genomic component variable values). Examples of treatments caninclude: irradiation, siRNA gene silencing, nutrient application, and/orother treatments. In a fourth example, the environmental parametervalues are predetermined and dictated by the growing environment. In arelated example, the treatment values can be determined based on othercausal variable values (e.g., the genomics of the organism being grown,historic environmental conditions, etc.). In an illustrative example, atreatment amount and frequency (e.g., watering, fertilization, etc.) canbe calculated given the phenotype model, the genome of the plantedorganism (e.g., genomic variable values), the measured environmentalconditions (e.g., environmental variable values; soil conditions,nitrogen concentration, etc.), and the desired phenotype value(s). In afifth example, the treatment parameters can be determined from a user'streatment practice (e.g., a farmer's fertilization schedule). However,the treatment parameters can be otherwise determined.

In a third variant, the breeding parameters include values extractedfrom the target causal variable value set. In an illustrative example,if the target causal variable value set includes an environmentalvariable value of 70° F., the temperature to grow the organism in orderto achieve this target causal variable value set should be 70° F.

In a fourth variant, the breeding parameters can be determinedexperimentally. In this variant, the method can include: growing anorganism, determining the causal variable values for the organism,predicting the phenotype value based on the causal variable values usingthe phenotype model, and selecting the organism (and/or the parents ofthe organism) based on the phenotype value. The organism can be bredusing the methods discussed above, randomly bred, and/or otherwisegrown. Selecting the organism can include: not killing the organism(e.g., not weeding the organism), treating the organism (e.g.,fertilizing the organism, replanting the organism, etc.), and/orotherwise selecting the organism. In variants, this can be performed inreal- or near-real time (e.g., while a treatment mechanism is passingover the plant bed) or asynchronously with organism treatment.

However, the breeding parameters can be otherwise determined.

Optionally, a target causal variable value set can be selected (e.g., inS730) based on the breeding parameters. For example, candidate causalvariable value sets can be weighted during selection in S735 based ontheir respective breeding parameters. In an illustrative example, acandidate causal variable value set that requires more breedinggenerations is weighted lower than a candidate causal variable value setthat requires fewer breeding generations. In another illustrativeexample, a candidate causal variable value set that prescribes moreexpensive treatments (e.g., more overall nitrogen, more nitrogenapplications, etc.) or growing conditions (e.g., a tight temperaturerange, a tight moisture range, etc.) can be weighted lower than thoseprescribing less expensive treatments or growing conditions. The methodcan optionally include breeding organisms in the population based on thebreeding parameters to generate a new organism (e.g., with the targetphenotype, with the target causal variable value set, etc.).

However, breeding parameters can be otherwise determined.

In an example, the method includes: observing values for a plurality ofvariables (e.g., genotype, environment, gene expression, DNAmethylation, transcriptome, etc.) and observing trait values (e.g.,phenotypes) for each organism in a population; generating a first modelbased on the observed trait values and the observed variable values; andidentifying causal variables for a trait from the plurality of variablesusing the first model. In another example, the method includes:identifying causal variables for a phenotype from the plurality ofvariables using a phenotype-variable association model; generating aphenotype model relating the causal variables (e.g., only the causalvariables) with the phenotype; determining multiple candidate causalvariable value sets (e.g., using predictive breeding, by permuting oneor more causal variable values); predicting a phenotype value for eachcandidate causal variable set using the phenotype model; selecting acandidate causal variable value set and/or the associated parent setbased on the respective predicted phenotype value; and optionallydetermining breeding parameter values, such as a series of breeding sets(e.g., breeding pairs), growing conditions, and/or treatments togenerate an organism with the selected candidate causal variable valueset. In variants, the causal variables can be those having a nonzero(e.g., positive and/or negative) association metric, wherein theassociation metric can be determined from a difference between an outputof the first model (e.g., coefficient, predicted phenotype value, etc.)given observed variable values and an output of the first model giventest values for the variable being tested. In variants, the first modelcan ignore inter-variable interaction effects, while the phenotype modelcan account for inter-variable interaction effects.

In a first specific example, the goal is to breed an organism that bestexpresses a phenotype within a given environment. In this example, theenvironmental variables can be fixed in the phenotype model, and thefirst model can include or exclude the environmental variables.

In a second specific example, the goal can be to grow an organism thatbest expresses a phenotype. In this example, the environmental variablescan be adjustable, and be accounted for in both the first and phenotypemodels.

However, S700 can be otherwise performed.

Alternative embodiments implement the above methods and/or processingmodules in non-transitory computer-readable media, storingcomputer-readable instructions that, when executed by a processingsystem, cause the processing system to perform the method(s) discussedherein. The instructions can be executed by computer-executablecomponents integrated with the computer-readable medium and/orprocessing system. The computer-readable medium may include any suitablecomputer readable media such as RAMs, ROMs, flash memory, EEPROMs,optical devices (CD or DVD), hard drives, floppy drives, non-transitorycomputer readable media, or any suitable device. The computer-executablecomponent can include a computing system and/or processing system (e.g.,including one or more collocated or distributed, remote or localprocessors) connected to the non-transitory computer-readable medium,such as CPUs, GPUs, TPUS, microprocessors, or ASICs, but theinstructions can alternatively or additionally be executed by anysuitable dedicated hardware device.

Embodiments of the system and/or method can include every combinationand permutation of the various system components and the various methodprocesses, wherein one or more instances of the method and/or processesdescribed herein can be performed asynchronously (e.g., sequentially),contemporaneously (e.g., concurrently, in parallel, etc.), or in anyother suitable order by and/or using one or more instances of thesystems, elements, and/or entities described herein. Components and/orprocesses of the following system and/or method can be used with, inaddition to, in lieu of, or otherwise integrated with all or a portionof the systems and/or methods disclosed in the applications mentionedabove, each of which are incorporated in their entirety by thisreference.

As a person skilled in the art will recognize from the previous detaileddescription and from the figures and claims, modifications and changescan be made to the preferred embodiments of the invention withoutdeparting from the scope of this invention defined in the followingclaims.

We claim:
 1. A method, comprising: a) for each of a set of organisms:determining a trait value for a trait; and determining variable valuesfor a set of variables; b) selecting a subset of variables from the setof variables; and c) determining a first model configured to predictvalues for variables of interest in the set of variables based on valuesfor the subset of variables; d) determining test variable values for thevariables of interest using the first model; e) using the test variablevalues, determining a second model comprising a relationship between theset of variables and the trait; and f) identifying causal variables fromthe set of variables based on the second model.
 2. The method of claim1, further comprising training the first model using values for a secondset of variables, different from the subset of variables.
 3. The methodof claim 2, wherein the first model comprises an autoencoder.
 4. Themethod of claim 1, wherein determining the first model comprises afitting a linear regression based on the variable values for the set ofvariables.
 5. The method of claim 1, further comprising clustering theset of variables based on autocorrelation analysis of the variablevalues, wherein the subset of variables and variables of interest arefrom a shared cluster.
 6. The method of claim 5, wherein variables inthe set of variables comprise k-mers.
 7. The method of claim 1, wherein(b)-(c) are iteratively repeated until a model fit metric for the firstmodel rises above a threshold.
 8. The method of claim 7, furthercomprising, in a first iteration, segmenting the subset of variablesinto high importance variables and low importance variables based on thefirst model, wherein, in a second iteration, selecting the subset ofvariables comprises replacing the low importance variables.
 9. Themethod of claim 1, wherein a size of the subset of variables is lessthan a size of the set of organisms.
 10. The method of claim 1, furthercomprising: determining target values for the causal variables based ona target trait; and based on the target values, breeding organisms inthe set of organisms to generate a new organism with the target trait.11. The method of claim 1, wherein variables in the set of variablescomprise variables for at least one of: loci, gene expression, proteinexpression, methylation, environmental parameters, or protein bindingaffinity.
 12. A method, comprising: for each of a set of organisms:determining a trait value for a trait; and determining variable valuesassociated with the trait value for a set of variables; transformingvalues for a subset of variables to a reduced dimension space;determining transformed test values for a variable of interest based onthe transformed values for the subset of variables; using thetransformed values for the subset of variables and the transformed testvalues, determining a second model comprising a relationship betweentransformed variables and the trait; identifying transformed causalvariables based on the second model; and decoding the transformed causalvariables to determine causal variables.
 13. The method of claim 12,wherein the reduced dimension space comprises a latent space, whereintransforming values for the subset of variables comprises training anautoencoder to encode variable values to the latent space, wherein thetrained autoencoder is used to decode the transformed causal variables.14. The method of claim 13, wherein the autoencoder is trained usingvalues for a second subset of variables, different from the subset ofvariables.
 15. The method of claim 12, further comprising determining adistribution of the subset of transformed variables, wherein determiningtransformed test values for the variable of interest comprises selectingtransformed test values from the distribution.
 16. The method of claim12, wherein transforming values for a subset of variables comprises:training a neural network to predict a trait value based on therespective variable values; and using a first layer of the trainedneural network to transform the values for the subset of variables. 17.The method of claim 12, further comprising selecting organisms from theset of organisms for cross-breeding based on the causal variables and atarget trait value.
 18. The method of claim 12, wherein the set ofvariables comprise genomic variables, wherein the subset of variables inthe reduced dimension space comprises a set of features, wherein thetransformed values comprise feature values, wherein the variable ofinterest comprises a feature of interest, wherein the transformed testvalues comprise test feature values for the feature of interest, whereinthe second model comprises a relationship between the feature ofinterest, the set of features, and the trait, wherein the transformedcausal variables comprise causal features of interest, and whereindecoding the transformed causal variables comprises decoding the causalfeatures of interest into causal genomic variables.